Recognizing Noisy Romanized Japanese Words in Learner English

نویسندگان

  • Ryo Nagata
  • Jun-ichi Kakegawa
  • Hiromi Sugimoto
  • Yukiko Yabuta
چکیده

This paper describes a method for recognizing romanized Japanese words in learner English. They become noise and problematic in a variety of tasks including Part-Of-Speech tagging, spell checking, and error detection because they are mostly unknown words. A problem one encounters when recognizing romanized Japanese words in learner English is that the spelling rules of romanized Japanese words are often violated by spelling errors. To address the problem, the described method uses a clustering algorithm reinforced by a small set of rules. Experiments show that it achieves an -measure of 0.879 and outperforms other methods. They also show that it only requires the target text and a fair size of English word list.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Method for Recognizing Noisy Romanized Japanese Words in Learner English

†Konan University, ‡Hyogo University of Teacher Education, *Japan Institute for Educational Measurement, Inc. We eating shshi (sushi)ski. I read a book. I play baseball everyday. And My brother play baseball to. “Becose, Let‘s play baseball in Japan.” School trip is very fun. “But, it’s very busy.” Becaes I like school trip. I like foodamathya (green tea) I i it d f K t I l I Romanized Japanese...

متن کامل

Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?

This article offers an empirical study on the different ways of encoding Chinese, Japanese, Korean (CJK) and English languages for text classification. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized words. For all encoding levels, whenever applicable, we provide comparisons with linear models, fastText (Joulin et al., 2016) an...

متن کامل

Finding Romanized Arabic Dialect in Code-Mixed Tweets

Recent computational work on Arabic dialect identification has focused primarily on building and annotating corpora written in Arabic script. Arabic dialects however also appear written in Roman script, especially in social media. This paper describes our recent work developing tweet corpora and a token-level classifier that identifies a romanized Arabic dialect and distinguishes it from French...

متن کامل

Non-native English speech recognition using bilingual English lexicon and acoustic models

This paper proposes an English speech recognition system which can recognize both non-native (i.e. Japanese) and native English speakers’ pronunciation of English speech. The system uses a bilingual pronunciation lexicon in which each word has both English and Japanese phoneme transcriptions. The Japanese transcription is constructed considering typical Japanese pronunciation of English. Japane...

متن کامل

Hedges in English for Academic Purposes: A Corpus-based study of Iranian EFL learners

Hedges, as tools to express tentativeness and doubt, have been studied in plenty of research papers in the Iranian EFL research setting. However, their use in a learner corpus, portraying Iranian learner English, is in need of more research attention. With this end in view, this study aimed at investigating how Iranian EFL learners who have majored in English-related fields in Iran deployed hed...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008